perf(fastlanes): fuse bit-packed compare into a transposed mask + untranspose#8239
perf(fastlanes): fuse bit-packed compare into a transposed mask + untranspose#8239joseph-isaacs wants to merge 15 commits into
Conversation
Add `bitpack_compare_sweep`, which exercises the public `array.binary(rhs, op)` compare-against-constant path over all eight integer types and every valid bit width (64Ki in-range elements per case, no patches). It isolates the `<BitPacked as CompareKernel>` unpack + per-element compare kernel so a kernel change shows up as a CodSpeed diff. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Merging this PR will improve performance by 59.63%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Simulation | compare[1] |
170.6 µs | 58.5 µs | ×2.9 |
| ⚡ | Simulation | compare[2] |
171.9 µs | 59.7 µs | ×2.9 |
| ⚡ | Simulation | compare[1] |
172.8 µs | 60.2 µs | ×2.9 |
| ⚡ | Simulation | compare[1] |
163.3 µs | 57.2 µs | ×2.9 |
| ⚡ | Simulation | compare[1] |
165.9 µs | 58.7 µs | ×2.8 |
| ⚡ | Simulation | compare[2] |
165 µs | 58.9 µs | ×2.8 |
| ⚡ | Simulation | compare[2] |
167.8 µs | 60.5 µs | ×2.8 |
| ⚡ | Simulation | compare[2] |
176.3 µs | 63.8 µs | ×2.8 |
| ⚡ | Simulation | compare[3] |
178.6 µs | 66.8 µs | ×2.7 |
| ⚡ | Simulation | compare[3] |
180.7 µs | 68.2 µs | ×2.7 |
| ⚡ | Simulation | compare[3] |
171.7 µs | 65.5 µs | ×2.6 |
| ⚡ | Simulation | compare[4] |
181.6 µs | 69.3 µs | ×2.6 |
| ⚡ | Simulation | compare[3] |
174.1 µs | 66.9 µs | ×2.6 |
| ⚡ | Simulation | compare[4] |
183.9 µs | 71.3 µs | ×2.6 |
| ⚡ | Simulation | compare[4] |
174.6 µs | 68.5 µs | ×2.5 |
| ⚡ | Simulation | compare[4] |
177.2 µs | 69.9 µs | ×2.5 |
| ⚡ | Simulation | compare[5] |
187.1 µs | 75.2 µs | ×2.5 |
| ⚡ | Simulation | compare[5] |
188.9 µs | 76.2 µs | ×2.5 |
| ⚡ | Simulation | compare[5] |
182.2 µs | 74.9 µs | ×2.4 |
| ⚡ | Simulation | compare[6] |
190.9 µs | 78.6 µs | ×2.4 |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing claude/confident-hamilton-mZIEo (bd3fbaa) with develop (583b003)
…ranspose Replace the unpack-then-compare streaming kernel for compare-against-constant with the FastLanes fused `unpack_cmp`: compare each value as it is unpacked, accumulating results straight into a transposed 1024-bit mask (`[u64; 16]`, one register-resident word per lane - no `[bool; 1024]`/`[T; 1024]` scratch), then a single SIMD `untranspose_bits` per block rotates the mask into logical row order, copied directly into the output bit buffer. Inline patches are spliced in afterwards; sliced (offset != 0) arrays fall back to the scalar streaming predicate. This requires the in-development FastLanes (PR #141 fused mask + PR #145 width-generic BMI2/VBMI untranspose), pinned via a git patch until released. Benchmarked end-to-end through the public compare path (`bitpack_compare_sweep`, 64Ki elements, all integer types and bit widths): fused beats the streaming baseline for every type and width - i8/u8 ~6.2-7.7x i16/u16 ~4.5-6.0x i32/u32 ~1.9-4.3x i64/u64 ~1.2-1.9x Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
e27f5f4 to
48da899
Compare
…space wasm-test is excluded from the workspace, so it does not inherit the root [patch.crates-io] and was building vortex-fastlanes against published fastlanes 0.5.0 (old `[bool;1024]` unpack_cmp, no `untranspose_bits`) -> compile error in compare_fused.rs. Add the matching git `rev` pin here. Temporary, like the root pin: both are removed when a FastLanes release is cut and the version is bumped. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.959x ➖ How to read Verdict and Engines
datafusion / vortex-file-compressed (0.959x ➖, 1↑ 0↓)
No file size changes detected. |
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.003x ➖, 0↑ 1↓)
datafusion / vortex-compact (0.990x ➖, 0↑ 0↓)
datafusion / parquet (0.997x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.976x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.975x ➖, 0↑ 0↓)
duckdb / parquet (1.004x ➖, 0↑ 0↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.007x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.006x ➖, 0↑ 0↓)
datafusion / parquet (0.994x ➖, 1↑ 1↓)
datafusion / arrow (1.010x ➖, 0↑ 3↓)
duckdb / vortex-file-compressed (1.003x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.998x ➖, 0↑ 0↓)
duckdb / parquet (1.009x ➖, 1↑ 1↓)
duckdb / duckdb (1.000x ➖, 0↑ 0↓)
File Size Changes (10 files changed, -0.1% overall, 4↑ 6↓)
Totals:
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.015x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.012x ➖, 0↑ 3↓)
datafusion / parquet (0.999x ➖, 2↑ 1↓)
duckdb / vortex-file-compressed (1.014x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.007x ➖, 3↑ 2↓)
duckdb / parquet (1.008x ➖, 0↑ 0↓)
duckdb / duckdb (1.017x ➖, 1↑ 2↓)
File Size Changes (6 files changed, -0.0% overall, 2↑ 4↓)
Totals:
|
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.363x ❌, 0↑ 3↓)
datafusion / vortex-compact (1.062x ➖, 0↑ 1↓)
datafusion / parquet (1.028x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.076x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.992x ➖, 0↑ 0↓)
duckdb / parquet (1.035x ➖, 0↑ 0↓)
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) How to read Verdict and Engines
duckdb / vortex-file-compressed (1.002x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.041x ➖, 0↑ 0↓)
duckdb / parquet (1.006x ➖, 0↑ 0↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: Random AccessVortex (geomean): 0.858x ✅ How to read Verdict and Engines
unknown / unknown (0.918x ➖, 16↑ 0↓)
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.073x ➖, 0↑ 3↓)
datafusion / vortex-compact (1.078x ➖, 0↑ 2↓)
datafusion / parquet (1.065x ➖, 0↑ 1↓)
datafusion / arrow (1.086x ➖, 0↑ 6↓)
duckdb / vortex-file-compressed (1.069x ➖, 0↑ 2↓)
duckdb / vortex-compact (1.062x ➖, 0↑ 1↓)
duckdb / parquet (1.036x ➖, 0↑ 0↓)
duckdb / duckdb (1.036x ➖, 0↑ 0↓)
File Size Changes (27 files changed, +0.0% overall, 13↑ 14↓)
Totals:
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.997x ➖, 1↑ 0↓)
datafusion / parquet (0.978x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.980x ➖, 3↑ 0↓)
duckdb / parquet (0.996x ➖, 0↑ 0↓)
duckdb / duckdb (1.003x ➖, 0↑ 1↓)
File Size Changes (104 files changed, +0.0% overall, 57↑ 47↓)
Totals:
|
Benchmarks: Appian on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.992x ➖, 0↑ 0↓)
datafusion / parquet (0.992x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.997x ➖, 0↑ 0↓)
duckdb / parquet (0.996x ➖, 0↑ 0↓)
duckdb / duckdb (0.996x ➖, 0↑ 0↓)
File Size Changes (3 files changed, -0.0% overall, 2↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.045x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.205x ➖, 0↑ 10↓)
datafusion / parquet (1.051x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.075x ➖, 0↑ 2↓)
duckdb / vortex-compact (1.000x ➖, 0↑ 2↓)
duckdb / parquet (1.037x ➖, 0↑ 1↓)
|
| let mut words: BufferMut<u64> = BufferMut::zeroed(num_chunks * WORDS_PER_CHUNK); | ||
|
|
||
| let chunks = array.unpacked_chunks::<T>()?; | ||
| { |
| /// new value. Avoids a data-dependent branch per patch in the patch-fixup loop, and touches the | ||
| /// target word through a single bounds-checked `&mut`. | ||
| #[inline] | ||
| fn set_bit(words: &mut [u64], idx: usize, value: bool) { |
There was a problem hiding this comment.
we should have set bit already?
Benchmarks: CompressionVortex (geomean): 0.986x ➖ How to read Verdict and Engines
unknown / unknown (0.958x ➖, 24↑ 2↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.143x ➖, 0↑ 5↓)
datafusion / vortex-compact (0.902x ➖, 1↑ 1↓)
datafusion / parquet (1.165x ➖, 0↑ 4↓)
duckdb / vortex-file-compressed (0.970x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.927x ➖, 1↑ 0↓)
duckdb / parquet (0.945x ➖, 0↑ 0↓)
|
…ernel ArrayRef::slice on a patched BitPackedArray leaves a lazy SliceArray (the buffer-free SliceReduce path bails when patches are present), so as_::<BitPacked>() panicked before the fused compare kernel ran. Acquire the sliced BitPacked through SliceKernel, which reads the buffers and produces a sliced BitPacked with sliced patches, so the test exercises the fused unpack_cmp + patch-fixup path it was written for. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
…te-paths Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Summary
Replaces the unpack-then-compare streaming kernel for compare-against-constant with the FastLanes fused
unpack_cmp:[u64; 16], one register-resident word per lane — no[bool; 1024]/[T; 1024]scratch),untranspose_bitsper block rotates the mask into logical row order, copied directly into the output bit buffer,offset != 0) arrays fall back to the scalar streaming predicate.